Performance optimized approach for efficient downsampling operations

ABSTRACT

The present invention provides an algorithm and hardware structure for numerical operations on signals that is reconfigurable to operate in a downsampling or non-downsampling mode. According to one embodiment, a plurality of adders and multipliers are reconfigurable via a switching fabric to operate as a plurality of MAAC kernels (described in detail below), when operating in a non-downsampling mode and a plurality of MAAC kernels and AMAAC kernels (described in detail below), when operating in a downsampling mode.

FIELD OF THE INVENTION

[0001] The present invention relates to the areas of computation andalgorithms and specifically to the areas of digital signal processing(“DSP”) and digital logic for performing DSP operations to the areas ofdigital signal processing, algorithms, structures and systems forperforming digital signal processing. In particular, the presentinvention relates to reconfigurable system for providingnon-downsampling and downsampling operations on a signal.

BACKGROUND INFORMATION

[0002] Digital signal processing (“DSP”) and information theorytechnology is essential to modem information processing and intelecommunications for both the efficient storage and transmission ofdata. In particular, effective multimedia communications includingspeech, audio and video relies on efficient methods and structures forcompression of the multimedia data in order to conserve bandwidth on thetransmission channel as well as to conserve storage requirements.

[0003] Many DSP algorithms rely on transform kernels such as an FFT(“Fast Fourier Transform”), DCT (“Discrete Cosine Transform”), etc. Forexample, the discrete cosine transform (“DCT”) has become a very widelyused component in performing compression of multimedia information, inparticular video information. The DCT is a loss-less mathematicaltransformation that converts a spatial or time representation of asignal into a frequency representation. The DCT offers attractiveproperties for converting between spatial/time domain and frequencyrepresentations of signals as opposed to other transforms such as theDFT (“Discrete Fourier Transform”)/FFT. In particular, the kernel of thetransform is real, reducing the complexity of processor calculationsthat must be performed. In addition, a significant advantage of the DCTfor compression is that it exhibits an energy compaction property,wherein the signal energy in the transform domain is concentrated in lowfrequency components, while higher frequency components are typicallymuch smaller in magnitude, and may often be discarded. The DCT is infact asymptotic to the statistically optimal Karhunen-Loeve transform(“KLT”) for Markov signals of all orders. Since its introduction in1974, the LICT has been used in many applications such as filtering,transmultiplexers, speech coding, image coding (still frame, video andimage storage), pattern recognition, image enhancement and SAR/IR imagecoding. The DCT has played an important role in commercial applicationsinvolving DSP, most notably it has been adopted by MPEG (“Motion PictureExperts Group”) for use in MPEG 2 and MPEG 4 video compressionalgorithms.

[0004] A computation that is common in digital filters such as finiteimpulse response (“FIR”) filters or linear transformations such as theDFT and DCT may be expressed mathematically by the following dot-productequation: $d = {\sum\limits_{i = 0}^{N - 1}{{a(i)}*{b(i)}}}$

[0005] where a(i) are the input data, b(i) are the filter coefficients(taps) and d is the output. Typically a multiply-accumulator (“MAC”) isemployed in traditional DSP design in order to accelerate this type ofcomputation. A MAC kernel can be described by the following equation:

d ^([i+1]) =d ^([i]) +a(i)*b(i) with initial value d ^([0])=0.

[0006] In some cases it is advantageous to downsample a signal Forexample, with images, it is often advantageous to view an image in asmaller frame. However, the algorithms for generating a downsampledsignal vs. a non-downsampled signal will typically vary significantly.Thus, typically it is required to provide separate hardware structuresto generate either a downsampled signal or a non-downsampled signal.This is highly disadvantageous as it results in increased hardware area,complexity and cost. Thus, it would be advantageous to develop ahardware structure capable of operating in one of a downsampling ornon-downsampling modes, while reducing the redundancy of hardwareelements as much as possible.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007]FIG. 1a is a block diagram of a video encoding system.

[0008]FIG. 1b is a block diagram of a video decoding system.

[0009]FIG. 2 is a block diagram of a datapath for computing a 2-D IDCT.

[0010]FIG. 3 is a block diagram illustrating the operation of a MACkernel.

[0011]FIG. 4 is a block diagram illustrating the operation of a MAACkernel according to one embodiment of the present invention.

[0012]FIG. 5 illustrates a paradigm for improving computationalprocesses utilizing a MAAC kernel according to one embodiment of thepresent invention.

[0013]FIG. 6 is a block diagram of a hardware architecture for computingan eight-point IDCT utilizing a MAAC kernel according to one embodimentof the present invention.

[0014]FIG. 7 is a block diagram of a datapath for computation of an8-point IDCT utilizing the method of the present invention and a numberof MAAC kernel components according to one embodiment of the presentinvention.

[0015]FIG. 8 is a block diagram illustrating the operation of an AMAACkernel according to one embodiment of the present invention.

[0016]FIG. 9 is a block diagram of a reconfigurable downsamplingcomputation engine according to one embodiment of the present invention.

[0017]FIG. 10 is a block diagram of a hardware architecture forcomputing an eight-point IDCT in a 2:1 downsampling mode according toone embodiment of the present invention.

[0018]FIG. 11a is a block diagram illustrating a datapath for computinga first path of an eight-point 2-D IDCT in a non-downsampling modeaccording to one embodiment of the present invention.

[0019]FIG. 11b is a block diagram illustrating a datapath for computinga second path of an eight-point 2-D IDCT in a non-downsampling modeaccording to one embodiment of the present invention.

[0020]FIG. 12a is a block diagram illustrating a datapath for computinga first path of an eight-point to four-point 2-D IDCT in a downsamplingmode according to one embodiment of the present invention.

[0021]FIG. 12b is a block diagram illustrating a datapath for computinga second path of an eight-point to four-point 2-D IDCT in a downsamplingmode according to one embodiment of the present invention.

DETAILED DESCRIPTION

[0022] The present invention provides an algorithm and hardwarestructure for numerical operations on signals that is reconfigurable tooperate in a downsampling or non-downsampling mode. According to oneembodiment, a plurality of adders and multipliers are reconfigurable viaa switching fabric to operate as a plurality of MAAC kernels (describedin detail below), when operating in a non-downsampling mode, and aplurality of MAAC kernels and AMAAC kernels (described in detail below),when operating in a downsampling mode. According to one embodiment, thedownsampling and non-downsampling operations are performed as part of anIDCT process.

[0023]FIG. 1a is a block diagram of a video encoding system. Videoencoding system 123 includes DCT block 111, quantization block 113,inverse quantization block 115, IDCT block 140, motion compensationblock 150, frame memory block 160, motion estimation block and VLC(“Variable Length Coder”) block 131. Input video is received indigitized form. Together with one or more reference video data fromframe memory, input video is provided to motion estimation block 121,where a motion estimation process is performed. The output of motionestimation block 121 containing motion information such as motionvectors is transferred to motion compensation block 150 and VLC block131. Using motion vectors and one or many reference video data, motioncompensation block 150 performs motion compensation process to generatemotion prediction results. Input video is subtracted at adder 170 a bythe motion prediction results from motion compensation block 150.

[0024] The output of adder is provided to DCT block 111 where a DCTcomputed. The output of the DCT is provided to quantization block 113,where the frequency coefficients are quantized and then transmitted toVLC (“Variable Length Coder”) 131, where a variable length codingprocess (e.g., Huffman coding) is performed. Motion information frommotion estimation block 121 and quantized indices of DCT coefficientsfrom Q block 113 are provided to VLC block 131. The output of VLC block131 is the compressed video data output from video encoder 123 Theoutput of quantities block 113 is also transmitted to inversequantization block 115, where an inverse quantization process isperformed.

[0025] The output of inverse quantization block is provided to IDCTblock 140, where IDCT is performed. The output of IDCT block is summeredat adder 107(b) with motion prediction results from motion compensation.The output of adder 170 b is reconstructed video data and is stored inthe frame memory block 160 to serve as reference data for the encodingof future video data.

[0026]FIG. 1b is a block diagram of a video decoding system. Videodecoding system 125 includes variable length decoder (“VLD”) block 110,inverse scan (“IS”) block 120, inverse quantization block (“IQ”) 130,IDCT block 140, frame memory block 160, motion compensation block 150and adder 170. A compressed video bitstream is received by VLD block anddecoded. The decoded symbols are converted into quantized indices of DCTcoefficients and their associated sequential locations in a particularscanning order. The sequential locations are then converted intofrequency-domain locations by the IS block 120. The quantized indices ofDCT coefficients are converted to DCT coefficients by the IQ block 130.The DCT coefficients are received by IDCT block 140 and transformed. Theoutput from the IDCT is then combined with the output of motioncompensation block 150 by the adder 170. The motion compensation block150 may reconstruct individual pictures based upon the changes from onepicture to its reference picture(s). Data from the reference picture(s),a previous one or a future one or both, may be stored in a temporaryframe memory block 160 such as a frame buffer and may be used as thereferences. The motion compensation block 150 uses the motion vectorsdecoded from the VLD 110 to determine how the current picture in thesequence changes from the reference picture(s). The output of the motioncompensation block 150 is the motion prediction data. The motionprediction data is added to the output of the IDCT 140 by the adder 170.The output from the adder 170 is then clipped (not shown) to become thereconstructed video data.

[0027]FIG. 2 is a block diagram of a datapath for computing a2-dimensional (2-D) IDCT according to one embodiment of the presentinvention. It includes a data multiplexer 205, a 1-D IDCT block 210, adata demultiplexer 207 and a transport storage unit 220. Incoming datafrom IQ is processed in two passes through the IDCT. In the first pass,the IDCT block is configured to perform a 1-D IDCT transform alongvertical direction. In this pass, data from IQ is selected by themultiplexer 210 and processed by the 1-D IDCT block 210. The output fromIDCT block 210 is an intermediate results that are selected by thedemultiplexer to be stored in the transport storage unit 220. In thesecond pass, IDCT block 210 is configured to perform 1-D IDCT alonghorizontal direction. As such, the intermediate data stored in thetransport storage unit 220 is selected by multiplexer 205, and processedby the 1-D IDCT block 210. Demultiplexer 207 outputs results from the1-D IDCT block as the final result of the 2-D IDCT.

[0028] Many computational processes such as the transforms describedabove (i.e., DCT, IDCT, DFT, etc) and filtering operations rely upon amultiply and accumulate kernel. That is, the algorithms are effectivelyperformed utilizing one or more multiply and accumulate componentstypically implemented as specialized hardware on a DSP or other computerchip. The commonality of the MAC nature of these processes has resultedin the development of particular digital logic and circuit structures tocarry out multiply and accumulate processes. In particular, afundamental component of any DSP chip today is the MAC unit.

[0029]FIG. 3 is a block diagram illustrating the operation of a MACkernel. Multiplier 310 performs multiplication of input datum a(i) andfilter coefficient b(i), the result of which is passed to adder 320.Adder 320 adds the result of multiplier 310 to accumulated outputd^([i]) which was previously stored in register 330. The output of adder320 (d^([i+1])) is then stored in register 330. Typically a MAC outputis generated on each clock cycle.

[0030]FIG. 4 is a block diagram illustrating the operation of a MAACkernel according to one embodiment of the present invention. The MAACkernel can be described by the following recursive equation:

d ^([i+1]) =d ^([i]) +a(i)*b(i)+c(i) with initial value d ^([0])=0.

[0031] MAAC kernel 405 includes multiplier 310, adder 320 and register330. Multiplier 310 performs multiplication of input datum a(i) andfilter coefficient b(i), the result of which is passed to adder 320.Adder 320 adds the result of multiplier 310 to a second input term c(i)along with accumulated output d which was previously stored in register330. The output of adder 320 (d^([i+1])) is then stored in register 330.

[0032] As an additional addition (c(i)) is performed each cycle, theMAAC kernel will have higher performance throughput for some class ofcomputations. For example, the throughput of a digital filter with somefilter coefficients equal to one can be improved utilizing the MAACarchitecture depicted in FIG. 4.

[0033]FIG. 5 illustrates a paradigm for improving computationalprocesses utilizing a MAAC kernel according to one embodiment of thepresent invention. In 510, an expression for a particular computation isdetermined. Typically, the computation is expressed as a linearcombination of input elements a(i) scaled by a respective coefficientb(i). That is, the present invention provides for improved efficiency ofperformance for computational problems that may be expressed in thegeneral form: $d = {\sum\limits_{i = 0}^{N - 1}{{a(i)}*{b(i)}}}$

[0034] where a(i) are the input data b(i) are coefficients and d is theoutput. As noted above, utilizing a traditional MAC architecture, outputd may be computed utilizing a kernel of the form:

d ^([i+1) ]=d ^([i]) +a(i)*b(i) with initial value d ^([0])=0.

[0035] This type of computation occurs very frequently in manyapplications including digital signal processing, digital filtering etc.

[0036] In 520, a common factor ‘c’ is factored out of the expressionobtaining the following expression:$d = {{c{\sum\limits_{i = 0}^{N - 1}{{a(i)}*{b^{\prime}(i)}\quad {where}\quad {b(i)}}}} = {{{cb}'}{(i).}}}$

[0037] If as a result of factoring the common factor c, some of thecoefficients b′(i) are unity, then the following result is obtained.${d = {{c\left( {{\sum\limits_{i = 0}^{M - 1}{{a(i)}*{b^{\prime}(i)}}} + {\sum\limits_{i = M}^{N - 1}{a(i)}}}\quad \right)}\quad {where}\quad \left\{ {{b^{\prime}(t)} = {1:{M \leq i \leq {N - 1}}}} \right\}}}\quad$

[0038] This may be effected, for example, by factoring a matrixexpression such that certain matrix entries are ‘1’. The aboveexpression lends itself to use of the MAAC kernel described above by therecursive equation:

d ^([i+1) ]=d ^([i]) +a(i)*b(i)+c(i) with initial value d ^([0])=0.

[0039] In this form the computation utilizes at least one addition percycle due to the unity coefficients.

[0040] In step 530, based upon the re-expression of the computationalprocess accomplished in step 510, one or more MAAC kernels are arrangedin a configuration to carry out the computational process as representedin its re-expressed form of step 520.

[0041] The paradigm depicted in FIG. 5 is particularly useful formultiply and accumulate computational processes. According to oneembodiment, described herein, the method of the present invention isapplied to provide a more efficient IDCT computation, which is amultiply and accumulate process typically carried out using a pluralityof MAC kernels.

[0042] According to one embodiment, the present invention is applied tothe IDCT in order to reduce computational complexity and improveefficiency. According to the present invention, the number of clockcycles required in a particular hardware implementation to carry out theIDCT is reduced significantly by application of the present invention.

[0043] The 2-D DCT may be expressed as follows: $\begin{matrix}{y_{kl} = \quad {\sqrt{\frac{2}{M}}{a(k)}\sqrt{\frac{2}{N}}{a(l)}{\sum\limits_{i = 0}^{M - 1}{\sum\limits_{k = 0}^{N - 1}{x_{ij}{\cos \left( \frac{\left( {{2i} + 1} \right)k\quad \pi}{2M} \right)}}}}}} \\{\quad {\cos \left( \frac{\left( {{2j} + 1} \right)/\pi}{2N} \right)}}\end{matrix}$ ${{where}\quad {a(k)}} = \begin{Bmatrix}{\frac{1}{\sqrt{2}},} & {{{if}\quad k} = 0} \\1 & {otherwise}\end{Bmatrix}$

[0044] The 2-D DCT and IDCT are separable and may be factored asfollows: $\begin{matrix}{x_{ij} = \quad {\sum\limits_{k = 0}^{M - 1}{\sum\limits_{l = 0}^{N - 1}{y_{kl}\sqrt{\frac{2}{M}}{a(k)}\sqrt{\frac{2}{N}}{a(l)}{\cos \left( \frac{\left( {{2i} + 1} \right)k\quad \pi}{2M} \right)}}}}} \\{\quad {\cos \left( \frac{\left( {{2j} + 1} \right)l\quad \pi}{2N} \right)}}\end{matrix}$ ${{where}\quad {a(k)}} = \begin{Bmatrix}{\frac{1}{\sqrt{2}},} & {{{if}\quad k} = 0} \\1 & {otherwise}\end{Bmatrix}$

[0045] The 2-D DCT and IDCT are separable and may be factored asfollows: $x_{ij} = {\sum\limits_{k = 0}^{M - 1}{z_{kj}{e_{i,M}(k)}}}$

[0046] for i=0, 1, . . . , M−1 and j=0, 1, . . . , N−1

[0047] where the temporal 1-D IDCT data are:$z_{k,j} = {\sum\limits_{l = 0}^{N - 1}{y_{kl}{e_{j,N}(l)}}}$

[0048] for k=0, 1, . . . , M−1 and j=0, 1, . . . , N−1

[0049] and the DCT basis vectors e₁(m) are:${e_{1,M}(k)} = {\sqrt{\frac{2}{M}}{a(k)}{\cos \left( \frac{\left( {{2i} + 1} \right)k\quad \pi}{2M} \right)}}$

[0050] for i,k=0, 1, . . . , M−1

[0051] A fast algorithm for calculating the IDCT (Chen) capitalizes ofthe cyclic property of the transform basis function (the cosinefunction). For example, for an eight-point IDCT, the basis function onlyassumes 8 different positive and negative values as shown in thefollowing table: j/1 0 1 2 3 4 5 6 7 0 c(0)  c(1)  c(2)  c(3)  c(4) c(5)  c(6)  c(7) 1 c(0)  c(3)  c(6) −c(7) −c(4) −c(1) −c(2) −c(5) 2c(0)  c(5) −c(6) −c(1) −c(4)  c(7)  c(2)  c(3) 3 c(0)  c(7) −c(2) −c(5) c(4)  c(3) −c(6) −c(1) 4 c(0) −c(7) −c(2)  c(5)  c(4) −c(3) −c(6)  c(1)5 c(0) −c(5) −c(6)  c(1) −c(4) −c(7)  c(2) −c(3) 6 c(0) −c(3)  c(6) c(7) −c(4)  c(1) −c(2)  c(5) 7 c(0) −c(1)  c(2) −c(3)  c(4) −c(5)  c(6)−c(7)

[0052] Where c(m) is the index of the following basis terms.$\begin{matrix}{{c(m)} = \quad {{a(m)}{\cos \left( \frac{m\quad \pi}{16} \right)}}} \\{= \quad \left\{ {{\cos \left( \frac{\pi}{4} \right)},{\cos \left( \frac{\pi}{16} \right)},{\cos \left( \frac{\pi}{8} \right)},{\cos \left( \frac{3\quad \pi}{16} \right)},{\cos \left( \frac{\pi}{4} \right)},{\cos \left( \frac{5\pi}{16} \right)},} \right.} \\{\quad \left. {{\cos \left( \frac{3\quad \pi}{8} \right)},{\cos \left( \frac{7\quad \pi}{16} \right)}} \right\}} \\{= \quad \left\{ {{\cos \left( \frac{\pi}{4} \right)},\cos,\left( \frac{\pi}{16} \right),{\cos \left( \frac{\pi}{8} \right)},{\cos \left( \frac{3\quad \pi}{16} \right)},{\cos \left( \frac{\pi}{4} \right)},{\sin \left( \frac{3\quad \pi}{16} \right)},} \right.} \\{\quad \left. {{\sin \left( \frac{\pi}{8} \right)},{\sin \left( \frac{\pi}{16} \right)}} \right\}}\end{matrix}$

[0053] The cyclical nature of the IDCT shown in the above table providesthe following relationship between output terms of the 1-D IDCT:$\frac{x_{i} + x_{7 - i}}{2} = {{{e_{i}(0)}y_{0}} + {{e_{i}(2)}y_{2}} + {{e_{i}(4)}y_{4}} + {{e_{1}(6)}y_{6}}}$$\frac{x_{i} + x_{7 - i}}{2} = {{{e_{i}(1)}y_{1}} + {{e_{i}(3)}y_{3}} + {{e_{i}(5)}y_{5}} + {{e_{i}(7)}y_{7}}}$

[0054] where the basis terms e_(i)(k) have sign and value mapped to theDCT basis terms c(m) according to the relationship:${e_{i}(k)} = {{\pm \frac{1}{2}}{c\left( {m\left( {i,k} \right)} \right)}}$

[0055] For a 4-point IDCT, the basis terms also have the symmetricalproperty illustrated in the above table as follows: j/1 0 1 2 3 0 C(0) C(2)  C(4)  C(6) 1 C(0)  C(6) −C(4) −C(2) 2 C(0) −C(6) −C(4)  C(2) 3C(0) −C(2)  C(4) −C(6)

[0056] The corresponding equations are:$\frac{x_{i} + x_{3 - 1}}{2} = {{{e_{i}(0)}y_{0}} + {{e_{i}(4)}y_{2}}}$$\frac{x_{i} - x_{3 - i}}{2} = {{{e_{i}(2)}y_{1}} + {{e_{i}(6)}y_{3}}}$

[0057] Based upon the above derivation, a ID 8-point IDCT can berepresented by the following matrix vector equation: $\begin{bmatrix}x_{0} \\x_{1} \\x_{2} \\x_{3}\end{bmatrix} = {{{\frac{1}{2}{A\begin{bmatrix}y_{0} \\y_{4} \\y_{2} \\y_{6}\end{bmatrix}}} + {\frac{1}{2}{{B\begin{bmatrix}y_{1} \\y_{5} \\y_{3} \\y_{7}\end{bmatrix}}\quad\begin{bmatrix}x_{7} \\x_{6} \\x_{5} \\x_{4}\end{bmatrix}}}} = {{\frac{1}{2}{A\begin{bmatrix}y_{0} \\y_{4} \\y_{2} \\y_{6}\end{bmatrix}}} - {\frac{1}{2}{B\begin{bmatrix}y_{1} \\y_{5} \\y_{3} \\y_{7}\end{bmatrix}}}}}$ where: ${A = \begin{bmatrix}{c(0)} & {c(4)} & {c(2)} & {c(6)} \\{c(0)} & {- {c(4)}} & {c(6)} & {- {c(2)}} \\{c(0)} & {- {c(4)}} & {- {c(6)}} & {c(2)} \\{c(0)} & {c(4)} & {- {c(2)}} & {- {c(6)}}\end{bmatrix}}\quad$ $B = \begin{bmatrix}{c(1)} & {c(5)} & {c(3)} & {c(7)} \\{c(3)} & {- {c(1)}} & {- {c(7)}} & {- {c(5)}} \\{c(5)} & {c(7)} & {- {c(1)}} & {c(3)} \\{c(7)} & {c(3)} & {- {c(5)}} & {- {c(1)}}\end{bmatrix}$

[0058] and ${c(0)} = {\cos \left( \frac{\pi}{4} \right)}$

[0059] and${c(n)} = {{\cos \left( \frac{n\quad \pi}{16} \right)}\left( {{n = 1},2,3,4,5,{6\quad 7}} \right)}$

[0060] Note that$A^{- 1} = {{\frac{1}{2}A^{T}\quad {and}\quad B^{- 1}} = {\frac{1}{2}B^{T}}}$

[0061] Using the paradigm depicted in FIG. 5, a common factor may befactored from the matrix equation above such that certain coefficientsare unity. The unity coefficients then allow for the introduction of anumber of MAAC kernels in a computational architecture, thereby reducingthe number of clock cycles required to carry out the IDCT. Inparticular, by factoring ${c(0)} = {{c(4)} = \frac{1}{\sqrt{2}}}$

[0062] out from the matrix vector equation above, the following equationis obtained. $\begin{bmatrix}x_{0} \\x_{1} \\x_{2} \\x_{3}\end{bmatrix} = {{{\frac{1}{2}{A^{\prime}\begin{bmatrix}y_{0} \\y_{4} \\y_{2} \\y_{6}\end{bmatrix}}} + {\frac{1}{2}{{B^{\prime}\begin{bmatrix}y_{1} \\y_{5} \\y_{3} \\y_{7}\end{bmatrix}}\quad\begin{bmatrix}x_{7} \\x_{6} \\x_{5} \\x_{4}\end{bmatrix}}}} = {{\frac{1}{2}{A^{\prime}\begin{bmatrix}y_{0} \\y_{4} \\y_{2} \\y_{6}\end{bmatrix}}} - {\frac{1}{2}{B^{\prime}\begin{bmatrix}y_{1} \\y_{5} \\y_{3} \\y_{7}\end{bmatrix}}}}}$ where: ${A^{\prime} = \begin{bmatrix}1 & 1 & {c^{\prime}(2)} & {c^{\prime}(6)} \\1 & {- 1} & {c^{\prime}(6)} & {- {c^{\prime}(2)}} \\1 & {- 1} & {- {c^{\prime}(6)}} & {c^{\prime}(2)} \\1 & 1 & {- {c^{\prime}(2)}} & {- {c^{\prime}(6)}}\end{bmatrix}}\quad$ $B^{\prime} = \begin{bmatrix}{c^{\prime}(1)} & {c^{\prime}(5)} & {c^{\prime}(3)} & {c^{\prime}(7)} \\{c^{\prime}(3)} & {- {c^{\prime}(1)}} & {- {c^{\prime}(7)}} & {- {c^{\prime}(5)}} \\{c^{\prime}(5)} & {c^{\prime}(7)} & {- {c^{\prime}(1)}} & {c^{\prime}(3)} \\{c^{\prime}(7)} & {c^{\prime}(3)} & {- {c^{\prime}(5)}} & {- {c^{\prime}(1)}}\end{bmatrix}$

[0063] Because the factor $\frac{1}{\sqrt{2}}$

[0064] is factored out of the matrix vector equation, the results aftertwo-dimensional operations would carry a scale factor of two. Dividingthe final result by 2 after the two-dimensional computation would resultin the correct transform.

[0065] Note that the expression for the IDCT derived above incorporatesmultiple instances of the generalized expression$d = {\sum\limits_{i = 0}^{N - 1}{{a(i)}*{b(i)}}}$

[0066] re-expressed as$d = {c\left( {{\sum\limits_{i = 0}^{M - 1}{{a(i)}*{b^{\prime}(i)}}} + {\sum\limits_{i = M}^{N - 1}{a(i)}}} \right)}$

[0067] where {b′(i)=1:M≦i≦N−1} to which the present invention isaddressed. This is a consequence of the nature of matrix multiplicationand may be seen as follows (unpacking the matrix multiplication):

[0068] x₀=y₀+y₄+c′(2)*y₂+c′(6)*y₆+c′(1)*y₁+c′(5)*y₅+c′(3)*y₃+c′(7)*y₇

[0069] x₁=y₀−y₄+c′(6)*y₂−c′(2)*y₆+c′(3)*y₁−c′(1)*y₅−c′(7)*y₃−c′(5)*y₇

[0070] x₂=y₀−y₄−c′(6)*y₂+c′(2)*y₆+c′(5)*y₁−c′(7)*y₅−c′(1)*y₃−c′(3)*y₇

[0071] x₃=y₀+y₄−c′(2)*y₂−c′(6)*y₆+c′(7)*y₁+c′(3)*y₅−c′(5)*y₃−c′(1)*y₇

[0072] x₇=y₀+y₄+c′(2)*y₂+c′(6)*y₆−c′(1)*y₁−c′(5)*y₅−c′(3)*y₃−c′(7)*y₇

[0073] x₆=y₀−y₄+c′(6)*y₂+c′(2)*y₆−c′(3)*y₁+c′(1)*y₅+c′(7)*y₃+c′(5)*y₇

[0074] x₅=y₀−y₄−c′(6)*y₂+c′(2)*y₆−c′(5)*y₁+c′(7)*y₅+c′(1)*y₃−c′(3)*y₇

[0075] x₄=y₀−y₄+c′(2)*y₂−c′(6)*y₆−c′(7)*y₁−c′(3)*y₅+c′(5)*y₃+c′(1)*y₇

[0076] Note that the above expressions do not incorporate scale factors½, which can be computed at the end of the calculation simply as a rightbit-shift.

[0077]FIG. 6 is a block diagram of a hardware architecture for computingan eight-point IDCT utilizing a MAAC kernel according to one embodimentof the present invention. The hardware architecture of FIG. 6 may beincorporated into a larger datapath for computation of an IDCT. As shownin FIG. 6, data loader 505 is coupled to four dual MAAC kernels405(1)-405(4), each dual MAAC kernel including two MAAC kernels sharinga common multiplier. Note that the architecture depicted in FIG. 6 ismerely illustrative and is not intended to limit the scope of the claimsappended hereto. The operation of the hardware architecture depicted inFIG. 5 for computing the IDCT will become evident with respect to thefollowing discussion.

[0078] Utilizing the architecture depicted in FIG. 6, a 1-D 8-point IDCTcan be computed in 5 clock cycles as follows:

[0079] 1^(st) clock:

[0080] mult1=c′(1)*y₁

[0081] mult2=c′(3)*y₁

[0082] mult3=c′(5)*y₁

[0083] mult4=c′(7)*y₁

[0084] x₀(clk1)=y₀+mult1+0

[0085] x₇(clk1)=y₀−mult1+0

[0086] x₁(clk1)=y₀+mult2+0

[0087] x₆(clk1)=y₀−mult2+0

[0088] x₂(clk1)=y₀+mult3+0

[0089] x₅(clk1)=y₀−mult3+0

[0090] x₃(clk1)=y₀+mult4+0

[0091] x₄(clk1)=y₀−mult4+0

[0092] 2nd Clock

[0093] mult1=c′(5)*y₅

[0094] mult2=−c′(1)*y₅

[0095] mult3=−c′(7)*y₅

[0096] mult4=c′(3)*y₅

[0097] x₀(clk2)=y₄+mult1+x₀(clk1)

[0098] x₇(clk2)=y₄−mult1+x₇(clk1)

[0099] x₁(clk2)=−y₄+mult2+x₁(clk1)

[0100] x₆(clk2)=−y₄−mult2+x₆(clk1)

[0101] x₂(clk2)=−y₄+mult3+x₂(clk1)

[0102] x₅(clk2)=−y₄−mult3+x₅(clk1)

[0103] x₃(clk2)=y₄+mult4+x₃(clk1)

[0104] x₄(clk2)=y₄−mult4+x₄(clk1)

[0105] 3rd Clock

[0106] mult1=c′(3)*y₃

[0107] mult2=−c′(7)*y₃

[0108] mult3=−c′(1)*y₃

[0109] mult4=−c′(5)*y₃

[0110] x₀(clk3)=0+mult1+x₀(clk2)

[0111] x₇(clk3)=0−mult1+x₇(clk2)

[0112] x₁(clk3)=0+mult2+x₁(clk2)

[0113] x₆(clk3)=0−mult2+x₆(clk2)

[0114] x₂(clk3)=0+mult3+x₂(clk2)

[0115] x₅(clk3)=0−mult3+x₅(clk2)

[0116] x₃(clk3)=0+mult4+x₃(clk2)

[0117] x₄(clk3)=0−mult4+x₄(clk2)

[0118] 4th Clock

[0119] mult1=c′(7)*y₇

[0120] mult2=−c′(5)*y₇

[0121] mult3=c′(3)*y₇

[0122] mult4=−c′(1)*y₇

[0123] x₀(clk4)=0+mult1+x₀(clk3)

[0124] x₇(clk4)=0−mult1+x₇(clk3)

[0125] x₁(clk4)=0+mult2+x₁(clk3)

[0126] x₆(clk4)=0−mult2+x₆(clk3)

[0127] x₂(clk4)=0+mult3+x₂(clk3)

[0128] x₅(clk4)=0−mult3+x₅(clk3)

[0129] x₃(clk4)=0+mult4+x₃(clk3)

[0130] x₄(clk4)=0−mult4+x₄(clk3)

[0131] 5th Clock

[0132] mult1=c′(2)*y₂

[0133] mult2=c′(6)*y₆

[0134] mult3=c′(6)*y₆

[0135] mult4=−c′(2)*y₂

[0136] x₀(clk5)=mult2+mult1+x₀(clk4)

[0137] x₇(clk5)=mult2+mult1+x₇(clk4)

[0138] x₁(clk5)=mult3+mult2+x₁(clk4)

[0139] x₆(clk5)=mult3+mult2+x₆(clk4)

[0140] x₂(clk5)=mult3−mult3+x₂(clk4)

[0141] x₅(clk5)=mult3−mult3+x₅(clk4)

[0142] x₃(clk5)=mult1−mult4+x₃(clk4)

[0143] x₄(clk5)=mult1−mult4+x₄(clk4)

[0144]FIG. 7 is a block diagram of a datapath for computation of an8-point IDCT utilizing the method of the present invention and a numberof MAAC kernel components according to one embodiment of the presentinvention. Note that the datapath shown in FIG. 7 includes four dualMAAC kernels 405(1)-405(4).

[0145] According to an alternative embodiment, the MAAC kernel ismodified to include two additional additions, to produce a structureherein referred to as the AMAAC kernel. The AMAAC kernel can bedescribed by the following recursive equation:

d ^([i+1]) =d ^([i]) +[a(i)+e(i)]*b(i)+c(i) with initial value d^([0])=0.

[0146]FIG. 8 is a block diagram illustrating the operation of an AMAACkernel according to one embodiment of the present invention. The AMAACkernel, as described below with reference to FIG. 9, provides astructure for achieving more efficient downsampling computations. AMAACkernel 805 includes multiplier 310, first adder 320 a, second adder 320b and register 330. First adder 320 a adds a(i) and e(i) Multiplier 310performs multiplication of input datum [a(i)+e(i)] and filtercoefficient b(i), the result of which is passed to adder 320 b. Adder320 b adds the result of multiplier 310 to a second input term c(i)along with accumulated output d^([1]), which was previously stored inregister 330. The output of adder 320 (d^([i+1])) is then stored inregister 330.

[0147] As two more additions are performed during the same AMAAC cycle,the AMAAC kernel has a higher performance throughput for some class ofcomputations. For example, a digital filter with some filtercoefficients with equal value can take advantage (speed up) of the AMAACkernel. Specifically, a(i), c(i), and e(i) can be considered as inputdata and b(i) as filter coefficients. With inputs a(i) and e(i) havingthe same filter coefficients b(i) and inputs c(i) with unitycoefficients, all three groups of inputs can be processed in parallel.

[0148] The present invention may be applied to downsampling or filteringoperations in general, not necessarily involving the IDCT. For example,the filtering of finite digital signals in the sample domain may beperformed using convolution. A well-known circular convolution may beobtained, for example, by generating a periodic extension of the signalthen applying a filter by performing a circular convolution on theperiodically extended signal and an appropriate filter. This may beefficiently performed in the DFT domain, for example, by obtaining asimple multiplication of the DFT coefficients of the signal and the DFTcoefficients of the filter and then applying the inverse DFT to theresult. For the DCT, a convolution may be applied that is related to,but different from the DFT convolution. This is described, for example,in “Symmetric Convolution and the Discrete Sine and Cosine Transforms,”by S. Martucci, IEEE Transactions on Signal Processing, Vol. 42, No. 5,May 1994, and includes a symmetric extension of the signal and filter,linear convolution, and applying a window to the result.

[0149] For example, considering a 2-D signal and a 2-D filter, or,assuming that the 2-D signal is represented as y_(k,l) with DCTcoefficients x_(i,j) where {i, k} are from 0 to M−1 and {j, l} are from0 to N−1, and assuming that the 2-D filter is represented as h_(p,q)where p ranges from 0 to P−1 and q ranges from 0 to Q−1. According tothis example, filter h_(p,q) may be a symmetric low pass even lengthfilter with filter length P and Q, where P=2M and Q=2N.

h _(p,q) =h _(2M−p−l,q) =h _(p,2N=1−1) = _(2M−p−1,2N−q−1) for p=0,1, . .. ,M−1 and q=0,1, . . . ,N−1.

[0150] The DCT (frequency domain) coefficients H_(k,l) for the filterh_(p,q) may be obtained by applying a 2-D DCT to the fourth quadrant ofthe filter: $\begin{matrix}{H_{k1} = \quad {\sqrt{\frac{2}{M}}{a(k)}\sqrt{\frac{2}{N}}{a(l)}{\sum\limits_{i = 0}^{M - 1}{\sum\limits_{j = 0}^{N - 1}h_{{M + i},{N + j}}}}}} \\{\quad {{\cos \left( \frac{\left( {{2i} + 1} \right)k\quad \pi}{2M} \right)}{\cos \left( \frac{\left( {{2j} + 1} \right)l\quad \pi}{2N} \right)}}}\end{matrix}$

[0151] for k=0, 1, . . . , M−1 and l=0, 1, . . . , N−1. Then thefiltering with respect to a particular sample in the inverse transformdomain is performed by element-by-element multiplication of the signalDCT coefficients, y_(k,l), and the filter DCT coefficients, H_(k,l), andtaking the appropriate inverse DCT transform of the DCT-domainmultiplication results:

Y _(kl) =H _(kl) ·y _(kl) for k=0,1, . . . ,M−1 and l=0,1, . . . ,N−1.

[0152] Downsampling may be performed in the DCT domain. For example,downsampling by two (2:1) in the horizontal direction, may be performedby taking the element-by-element multiplication of a signal that hasbeen filtered, for example, according to the relationship:$Y_{k,l}^{\prime} = {\frac{1}{\sqrt{2}}\left( {Y_{k,l} - Y_{k,{N - 1}}} \right)}$

[0153] for k=0,1, . . . , M−1 and l=0,1, . . . , N/2−1.

[0154] Similarly, 2:1 downsampling in the vertical direction can also beperformed in the DCT domain.

[0155] The decimated signal is then obtained by applying the inverse DCTtransform of length N/2 to Y′_(k,l). There are several special casesthat might be usefully applied in this embodiment, although theinvention is not limited in scope in this respect. For example, abrickwall filter with coefficients [1 1 1 1 0 0 0 0] in the DCT domainmay be implemented that can further simplify the 8-point DCT domaindownsampling by two operation. Specifically, the special filter shapeavoids folding and addition. Another filer with coefficients [1 1 1 10.5 0 0 0] provides a transform function of an antialising filter forthe 2:1 operation. Other filters may also be employed, of course.

[0156] In order to map such filtered downsampling operation to an AMAACcomputation kernel. The element-by-element multiplication operation inDCT domain can be incorporated in the Inverse Quantization block.Specifically, the filter DCT coefficients, H_(k,l), can be combinedtogether with the inverse quantization coefficients. Subsequently, theoutput of the IQ block is the filtered DCT coefficients, Y_(k,l), of thesignal.

[0157]FIG. 9 is a block diagram of a reconfigurable downsamplingcomputation engine according to one embodiment of the present invention.As shown in FIG. 9, a plurality of adders and multipliers 910 areconfigured via switching fabric 905 to operate in either anon-downsampling 920 mode or a downsampling mode 930.

[0158] According to one embodiment, to operate in non-downsampling mode920, adders and multipliers 910 are configured via switching fabric 905as a plurality of MAAC kernels 405(1)-405(N). To operate in downsamplingmode 930, adders and multipliers 910 are configured via switching fabric905 as a plurality of MAAC kernels 405(1)-405(N) and a plurality ofAMAAC kernels 805(1)-805(N).

[0159] According to one embodiment, MAAC and AMAAC computational kernelsmay be combined to generate a reconfigurable computation engine (forexample, to compute the IDCT). By allowing this reconfiguration,hardware logic gates can be shared to improve performance withoutincurring additional cost.

[0160] For example, a typical algorithm for computing a 1-D ID CT with2:1 downsampling is expressed as follows: $\begin{bmatrix}x_{0} \\x_{1} \\x_{2} \\x_{3}\end{bmatrix} = {{c(4)}*{A^{\prime}\begin{bmatrix}{y_{0} - y_{7}} \\{y_{2} - y_{5}} \\{y_{1} - y_{6}} \\{y_{3} - y_{4}}\end{bmatrix}}{where}}$ $A = \begin{bmatrix}1 & 1 & {c^{\prime}(2)} & {c^{\prime}(6)} \\1 & {- 1} & {c^{\prime}(6)} & {- {c^{\prime}(2)}} \\1 & {- 1} & {- {c^{\prime}(6)}} & {c^{\prime}(2)} \\1 & 1 & {- {c(2)}} & {- {c^{\prime}(6)}}\end{bmatrix}$

[0161] Note that for the downsampling operation, compared with anon-downsampling 1-D IDCT, addition is applied to the input data {y0,y1, . . . , y7} first followed by multiplication with the coefficientsc′. Since in the first path the input DCT coefficients arrive in azig-zag order, for a given column the 1-D DCT coefficients arriveserially but interleaved with varying numbers of coefficients in othercolumns. If the downsampling operation is directly implemented usingconventional hardware, there will be a significant number of idle cycles(bubbles) because of the random order of the arriving y coefficients.This may result in pipeline stalls in a conventional hardwarearchitecture.

[0162] According to one embodiment, a reconfigurable hardwarearchitecture is realized by performing multiplication operations to theinput y terms arriving serially first followed by additions. Thisordering may be realized upon examination of the downsampling operationin expanded form as follows: $\begin{bmatrix}x_{0} \\x_{1} \\x_{2} \\x_{3}\end{bmatrix} = {{c(4)}\begin{bmatrix}{y_{0} - y_{7} + y_{2} - y_{5} + {{c^{\prime}(2)}*y_{1}} - {{c^{\prime}(2)}*y_{6}} + {{c^{\prime}(6)}*y_{3}} - {c^{\prime}6*y_{4}}} \\{y_{0} - y_{7} - y_{2} + y_{5} + {{c^{\prime}(6)}*y_{1}} - {{c^{\prime}(6)}*y_{6}} - {{c^{\prime}(2)}*y_{3}} + {{c^{\prime}(2)}*y_{4}}} \\{y_{0} - y_{7} - y_{2} + y_{5} - {{c^{\prime}(6)}*y_{1}} + {{c^{\prime}(6)}*y_{6}} + {{c^{\prime}(2)}*y_{3}} - {{c^{\prime}(2)}*y_{4}}} \\{y_{0} - y_{7} + y_{2} - y_{5} - {{c^{\prime}(2)}*y_{1}} + {{c^{\prime}(2)}*y_{6}} - {{c^{\prime}(6)}*y_{3}} + {{c^{\prime}(6)}*y_{4}}}\end{bmatrix}}$

[0163] In the expanded vector equation above, note that multiplicationoperations are applied to the input y terms arriving serially firstfollowed by additions. According to one embodiment, in 2:1 downsamplingmode, the higher order coefficients (y5, y6 and y7) may be zeroedwithout causing significant degradation to the output video quality.This is a result of the nature of the energy compaction property of theIDCT.

[0164] Zeroing the higher order coefficients, the following expressionis obtained: $\begin{bmatrix}x_{0} \\x_{1} \\x_{2} \\x_{3}\end{bmatrix} = {{c(4)}*{A^{\prime}\begin{bmatrix}y_{0} \\y_{2} \\y_{1} \\{y_{3} - y_{4}}\end{bmatrix}}}$

[0165] By expanding the above matrix equation, the followingrelationship is obtained:

[0166] x₀=y₀+y₂+c′(2)*y₁+c′(6)*(y₃−y₄)

[0167] x₁=y₀−y₂+c′(6)*y₁+(−c′(2))*(y₃−y₄)

[0168] x₂=y₀−y₂+(−c′(6))*y₁+c′(2)*(y₃−y₄)

[0169] x₀=y₀−y₂+c′(2)*y₁+c′(6))*(y₃−y₄)

[0170] According to one embodiment of the present invention, the aboveequations are realized according to a hardware embodiment depicted inFIG. 10. Referring to FIG. 10, in the downsampling mode, adders andmultipliers are configured as MAAC kernels 405(1) (multiplier 310(2) and320(6)) and 405(2) (multiplier 310(2) and adder 320(4)) and an AMAACkernel 805 (adders 320(1), 320(3) and multiplier 310(4)). Comparing thedownsampling configuration shown in FIG. 10 with the non-downsamplingconfiguration shown in FIG. 6, note that the four multipliers and eightadders utilized in the non-downsampling mode are also shown in thedownsampling mode operation. However, note that four of adders areutilized as shared adders, specifically shared adder 320(5) computingy₀−y₂, shared adder 320(2), computing y₀+y₂ and shared adder 320(1)computing y₃−y₄. Note that adder 320(7) is not utilized in thedownsampling configuration.

[0171]FIG. 11a is a block diagram illustrating a datapath for computinga first path of an eight-point 2-D IDCT in a non-downsampling mode (alsocalled the 1h1v mode as the scaling ratios along both horizontal (h)direction and vertical (v) direction are 1:1) according to oneembodiment of the present invention. The hardware architecture shown inFIG. 11a improves IDCT computation by simultaneously processing additionand multiply terms.

[0172] As derived above, an 8-point IDCT may be expressed as follows:$\begin{bmatrix}x_{0} \\x_{1} \\x_{2} \\x_{3}\end{bmatrix} = {{{A^{\prime}\begin{bmatrix}y_{0} \\y_{4} \\y_{2} \\y_{6}\end{bmatrix}} + {{B^{\prime}\begin{bmatrix}y_{1} \\y_{5} \\y_{3} \\y_{7}\end{bmatrix}}\quad\begin{bmatrix}x_{7} \\x_{6} \\x_{5} \\x_{4}\end{bmatrix}}} = {{A^{\prime}\begin{bmatrix}y_{0} \\y_{4} \\y_{2} \\y_{6}\end{bmatrix}} - {B^{\prime}\begin{bmatrix}y_{1} \\y_{5} \\y_{3} \\y_{7}\end{bmatrix}}}}$ ${A^{\prime} = \begin{bmatrix}1 & 1 & {c^{\prime}(2)} & {c^{\prime}(6)} \\1 & {- 1} & {c^{\prime}(6)} & {- {c^{\prime}(2)}} \\1 & {- 1} & {- {c^{\prime}(6)}} & {c^{\prime}(2)} \\1 & 1 & {- {c^{\prime}(2)}} & {- {c^{\prime}(6)}}\end{bmatrix}}\quad$ $B^{\prime} = \begin{bmatrix}{c^{\prime}(1)} & {c^{\prime}(5)} & {c^{\prime}(3)} & {c^{\prime}(7)} \\{c^{\prime}(3)} & {- {c^{\prime}(1)}} & {- {c^{\prime}(7)}} & {- {c^{\prime}(5)}} \\{c^{\prime}(5)} & {c^{\prime}(7)} & {- {c^{\prime}(1)}} & {c^{\prime}(3)} \\{c^{\prime}(7)} & {c^{\prime}(3)} & {- {c^{\prime}(5)}} & {- {c^{\prime}(1)}}\end{bmatrix}$

[0173] As shown in FIG. 11a, the first path of the 1h1v mode isconfigured to include four dual MAAC kernels 405(1)-405(4). Note thatthis configuration corresponds directly to FIG. 6. The four dual MAACkernels 405(1)-405(4) allow simultaneously processing of addition andmultiply terms. According to the above equation, IQ block 130 generatescoefficients that must be processed either by performing amultiplication or an addition. The utilization of the MAAC kernels inthe datapath shown in FIG. 11a (see FIG. 6) allows simultaneousperformance of the multiplication and addition, improving performance.

[0174] A portion of the architecture shown in FIG. 11a is responsiblefor demultiplexing addition and multiply terms received from IQ block130. In particular, coefficients from IQ block 130 (not shown) arereceived through IQ interface where they are demultiplexed onto node1125 (addterm) or node 1130 (multiply term) depending upon whether thecoefficient is a multiply or addition term.

[0175] Multiply terms are then passed from node 1130 to one ofmultipliers 310(a)-310(d) via combinational logic. Similarly, additionterms are passed from node 1125 to one of adders 320(a)-320(h). Inparticular, in the 1h1v first path, y0 and y4 are addition terms whiley1, y2, y3, y5, y6 and y7 are multiply terms. Note that these terms maybe utilized immediately upon generation from IQ block 130. That is, abubble is not introduced into the pipeline while waiting forcoefficients from IQ block 130.

[0176] The intermediate output terms of the first path of the IDCT arestored in transport storage unit 1105 (TRAM) where they await processingby the second path of the IDCT computation.

[0177]FIG. 11b is a block diagram illustrating a datapath for computinga second path of an eight-point 2-D IDCT in a non-downsampling modeaccording to one embodiment of the present invention. Note that theoperative equation for the second path is identical to that presentedabove for the first path (see FIG. 11a and accompanying text). Theconfiguration for the second 1h1v path corresponds directly to FIG. 6.Thus, similar to the first path four dual MAAC kernels 405(1)-405(4)allow simultaneously processing of addition and multiply terms.According to the above equation, IQ block 130 generates coefficientsthat must be processed either by performing a multiplication or anaddition. The utilization of the MAAC kernels in the datapath shown inFIG. 11b (see FIG. 6) allows simultaneous performance of themultiplication and addition, improving performance.

[0178] With respect to the data flow, the only difference is that yterms in the equation are originated from TRAM 1105 and the calculated xterms are output to motion compensation block 150. Note that uponcompletion of the 1h1v 2nd path, 3 right shifts are performed due to theinitial factoring of ½*C(4) in the first and second paths. Thus, at theend three right shifts ½*c(4)*½*c(4)=½*½*½ are required in order toobtain the final correct 2-D IDCT results.

[0179]FIG. 12a is a block diagram illustrating a datapath for computinga first path of an eight-point to four-point 2-D IDCT in a downsamplingmode (also called the 2h2v mode as the scaling ratios along bothhorizontal (h) direction and vertical (v) direction are 2:1) accordingto one embodiment of the present invention. Recall, as derived above,computation of a 2-1 downsampled eight-point IDCT may be expressed asfollows: $\begin{bmatrix}x_{0} \\x_{1} \\x_{2} \\x_{3}\end{bmatrix} = {\quad\left\lbrack \begin{matrix}{{y0} - {y7} + {y2} - {y5} + {{c^{\prime}(2)}*{y1}} - {{c^{\prime}(2)}*{y6}} + {{c^{\prime}(6)}*{y3}} - {{c^{\prime}(6)}*{y4}}} \\{{y0} - {y7} - {y2} + {y5} + {{c^{\prime \quad}(6)}*{y1}} - {{c^{\prime}(6)}*{y6}} - {{c^{\prime}(2)}*{y3}} + {{c^{\prime}(2)}*{y4}}} \\{{y0} - {y7} - {y2} + {y5} - {{c^{\prime}(6)}*{y1}} + {{c^{\prime}(6)}*{y6}} + {{c^{\prime}(2)}*{y3}} - {{c^{\prime}(2)}*{y4}}} \\{{y0} - {y7} + {y2} - {y5} - {{c^{\prime}(2)}*{y1}} + {{c^{\prime}(2)}*{y6}} - {{c^{\prime}(6)}*{y3}} + {{c^{\prime}(6)}*{y4}}}\end{matrix}\quad \right\rbrack}$

[0180] As shown in FIG. 12a, the first path of the 2h2v mode isconfigured to include MAAC kernels 405(1)-405(4). Note that thisconfiguration corresponds directly to FIG. 6. The use of four MAACkernels 405(1)-405(4) allows simultaneously processing of addition andmultiply terms. According to the above equation, IQ block 130 generatescoefficients that must be processed either by performing amultiplication or an addition. The utilization of the MAAC kernels405(1)-405(4) in the datapath shown in FIG. 12a (see FIG. 6) allowssimultaneous performance of the multiplication and addition, improvingperformance.

[0181] In particular, in the 2h2v first path, y0, y7, y2 and y5 areinvolved in an addition operation and y1, y3, y4, y6 are involved in amultiply operation. Using the above architecture, it will take 4 clocksfor 1-column coefficients to finish the first path of a 2-D 2h2v IDCT.For an 8×8 block with all non-zero coefficients, it will take 4×8=32clocks to finish 1-D IDCT. Because it is 2h2v mode, and 2:1 downsamplingis performed in a vertical direction, every time, only four terms areoutput per operation. Thus, although the architecture includes 4multipliers and 8 adders in the block diagram of FIG. 12a, only 4multipliers and 4 adders are actually involved in the computation in thefirst path.

[0182]FIG. 12b is a block diagram illustrating a datapath for computinga second path of an eight-point to four-point 2-D IDCT in a downsamplingmode according to one embodiment of the present invention. In this pathone AMAAC kernel 805 is utilized along with three additional additioncomputations. Note that the AMAAC kernel 805 utilizes adder 320(a) in ashared configuration. In the 2h2v 2nd path, 7 adders and 4 multipliersare utilized.

[0183] Also note that in the 2nd path of 2h2v, all the y inputs (y0, y1. . . , y4) originate from TRAM 1105, and thus all terms are availablesimultaneously. Further, note that the equation cited above for thefirst path 2h2v is also operative for the second path. By rewriting theabove matrix equation as

x0=y0+y2+c′(2)*y1+c′(6)*(y3−y4)

x1=y0−y2+c′(6)*y1+c′(2)*(y3−y4)

x2=y0−y2+c′(6)*y1+c′(2)*(y3−y4)

x3=y0+y2+c′(2)*y1+c′(6)*(y3−y4)

[0184] it can be seen that the other adders 320(a)-320(c) may be used tocalculate:

add1=y3−y4

add2=y0−y2

add3=y0+y2

[0185] and 4 multipliers 310(a)-320(d) can be used to calculate:

mult1=c′(2)*(y3−y4)=c′(2)*add1

mult2=c′(6)*y1=c′(6)*y1

mult3=c′(2)*y1=c′(6)*y1

mult4=(−c′(6))*(y3−y4)=(−c′(6))*add1

[0186] In the final stage, the following state is obtained:

x0=add3+mult3+(−mult4)

x1=add2+mult2+(−mult1)

x2=add2+(−mult2)+(mult1)

x3=add3+(−mult3)+(mult4)

[0187] Thus, for the second path in 2h2v mode, all 1-D row IDCT can becompleted in just one cycle. This improved throughput in IDCT stagematches very well with the improved throughput in motion compensationunit 150 that follows.

[0188] At the completion of the 2h2v 2nd path a right shift is necessarybecause in the first path and second path c(4) was factored out, andtherefore the calculation c(4)*C(4)=½ must be performed at the end.

[0189] According to one embodiment the MAAC and AMAAC operations may beincorporated into the instruction set of a CPU or DSP processor thatincorporates one MAAC and/or AMAAC kernels. This would allow compilationof a source program to directly take advantage of these hardwarestructures for signal processing operations.

What is claimed is:
 1. A reconfigurable hardware apparatus forperforming computational operations in one of a downsampling mode and anon-downsampling mode, comprising: a plurality of adders, each of theplurality of adders including at least two inputs and one output; aplurality of multipliers, each of the plurality of multipliers includingat least two inputs and one output; a switching fabric for switchingbetween a downsampling mode of operation and a non-downsampling mode ofoperation, wherein the switching fabric provides for a configuration ofthe inputs and outputs of the adders with respect to the inputs andoutputs of the multipliers; and, a control logic block for controllingthe switching fabric.
 2. The hardware apparatus according to claim 1,wherein in the non-downsampling mode, the switching apparatus configuresthe multipliers and adders to include a plurality of MAAC kernels. 3.The hardware apparatus according to claim 1, wherein in the downsamplingmode, the switching apparatus configures the multipliers and adders toinclude a plurality of MAAC kernels and at least one AMAAC kernel. 4.The hardware apparatus according to claim 2, wherein the MAAC kernelincludes a multiplier block, an adder block and a register block,wherein an output of the multiplier block is coupled to an input of theadder block, an output of the adder block is coupled to an input of theregister block and an output of the register block is coupled to asecond input of the adder block and the adder block receives at itssecond input an additional addend.
 5. The hardware apparatus accordingto claim 3, wherein the AMAAC kernel includes a multiplier block, afirst adder block, a second adder block and a register block, whereinthe first adder block receives two inputs (e(i) and a(i)) and an outputof the first adder block is coupled to a first input of the multiplierblock, the multiplier block receiving a second input (b(i)) and anoutput of the multiplier block coupled to a first input of the secondadder block, the second adder block receiving a second input (c(i)) andan output of the second adder block is coupled to an input of theregister block, an output of the register block coupled to a third inputof the second adder block.
 6. The hardware apparatus according to claim1, wherein the computational operations include transformations.
 7. Thehardware apparatus according to claim 6, wherein the transformationsinclude an inverse Discrete Cosine Transform (IDCT).
 8. The hardwareapparatus according to claim 7, wherein in the non-downsampling mode, aneight-point IDCT is computed utilizing the following expression:$\begin{bmatrix}x_{0} \\x_{1} \\x_{2} \\x_{3}\end{bmatrix} = {{{\frac{1}{2}{A^{\prime}\begin{bmatrix}y_{0} \\y_{4} \\y_{2} \\y_{6}\end{bmatrix}}} + {\frac{1}{2}{{B^{\prime}\begin{bmatrix}y_{1} \\y_{5} \\y_{3} \\y_{7}\end{bmatrix}}\quad\begin{bmatrix}x_{7} \\x_{6} \\x_{5} \\x_{4}\end{bmatrix}}}} = {{\frac{1}{2}{A^{\prime}\begin{bmatrix}y_{0} \\y_{4} \\y_{2} \\y_{6}\end{bmatrix}}} - {\frac{1}{2}{B^{\prime}\begin{bmatrix}y_{1} \\y_{5} \\y_{3} \\y_{7}\end{bmatrix}}}}}$ where: ${A^{\prime} = \begin{bmatrix}1 & 1 & {c^{\prime}(2)} & {c^{\prime}(6)} \\1 & {- 1} & {c^{\prime}(6)} & {- {c^{\prime}(2)}} \\1 & {- 1} & {- {c^{\prime}(6)}} & {c^{\prime}(2)} \\1 & 1 & {- {c^{\prime}(2)}} & {- {c^{\prime}(6)}}\end{bmatrix}}\quad$ $B = \begin{bmatrix}{c^{\prime}(1)} & {c^{\prime}(5)} & {c^{\prime}(3)} & {c^{\prime}(7)} \\{c^{\prime}(3)} & {- {c^{\prime}(1)}} & {- {c^{\prime}(7)}} & {- {c^{\prime}(5)}} \\{c^{\prime}(5)} & {c^{\prime}(7)} & {- {c^{\prime}(1)}} & {c^{\prime}(3)} \\{c^{\prime}(7)} & {c^{\prime}(3)} & {- {c^{\prime}(5)}} & {- {c^{\prime}(1)}}\end{bmatrix}$


9. The hardware apparatus according to claim 7, wherein in thedownsampling mode, a 2:1 downsampling of an eight-point IDCT is computedutilizing the following expression: $\begin{bmatrix}x_{0} \\x_{1} \\x_{2} \\x_{3}\end{bmatrix} = {{c(4)}\begin{bmatrix}{y_{0\quad} - y_{7} + y_{2} - y_{5} + {{c^{\prime}(2)}*y_{1}} - {{c^{\prime}(2)}*y_{6}} + {{c^{\prime}(6)}*y_{3}} - {c^{\prime}6*y_{4}}} \\{y_{0} - y_{7} - y_{2} + y_{5} + {{c^{\prime}(6)}*y_{1}} - {{c^{\prime}(6)}*y_{6}} - {{c^{\prime}(2)}*y_{3}} + {{c^{\prime}(2)}*y_{4}}} \\{y_{0} - y_{7} - y_{2} + y_{5} - {{c^{\prime}(6)}*y_{1}} + {{c^{\prime}(6)}*y_{6}} + {{c^{\prime}(2)}*y_{3}} - {{c^{\prime}(2)}*y_{4}}} \\{y_{0\quad} - y_{7} + y_{2} - y_{5} - {{c^{\prime}(2)}*y_{1}} + {{c^{\prime}(2)}*y_{6}} - {{c^{\prime}(6)}*y_{3}} + {{c^{\prime}(6)}*y_{4}}}\end{bmatrix}}$


10. A hardware apparatus for computing an IDCT in one of a downsamplingmode and a non-downsampling mode comprising: a data loader block; aplurality of MAAC kernels; and at least one AMAAC kernel.
 11. Thehardware apparatus according to claim 10, wherein each MAAC kernelincludes a multiplier block, an adder block and a register block,wherein an output of the multiplier block is coupled to an input of theadder block, an output of the adder block is coupled to an input of theregister block and an output of the register block is coupled to asecond input of the adder block and the adder block receives at itssecond input an additional addend.
 12. The hardware apparatus accordingto claim 10, wherein each AMAAC kernel includes a multiplier block, afirst adder block, a second adder block and a register block, whereinthe first adder block receives two inputs (e(i) and a(i)) and an outputof the first adder block is coupled to a first input of the multiplierblock, the multiplier block receiving a second input (b(i)) and anoutput of the multiplier block coupled to a first input of the secondadder block, the second adder block receiving a second input (c(i)) andan output of the second adder block is coupled to an input of theregister block, an output of the register block coupled to a third inputof the second adder block.
 13. A system for performing downsamplingcomputations: at least one AMAAC kernel, wherein each AMAAC kernelincludes: a multiplier block; a first adder block; a second adder block;a register block; wherein the first adder block receives two inputs(e(i) and a(i)) and an output of the first adder block is coupled to afirst input of the multiplier block, the multiplier block receiving asecond input (b(i)) and an output of the multiplier block coupled to afirst input of the second adder block, the second adder block receivinga second input (c(i)) and an output of the second adder block is coupledto an input of the register block, an output of the register blockcoupled to a third input of the second adder block; at least one MAACkernel, wherein each MAAC kernel includes: a multiplier block; an adderblock; a register block; wherein an output of the multiplier block iscoupled to a first input of the adder block, an output of the adderblock is coupled to an input of the register block and an output of theregister block is coupled to a second input of the adder block, theadder block receiving a third input.
 14. The system according to claim13, wherein the downsampling computations are performed as part of avideo decoding system.
 15. The system according to claim 13, wherein thedownsampling operations are performed in conjunction with an IDCTprocess.
 16. A digital signal processor wherein the digital signalprocessor includes: at least one MAAC kernel; and, at least one AMAACkernel.
 17. The digital signal processor according to claim 16, whereinthe digital signal processor includes instructions for controlling theat least one MAAC kernel.
 18. The digital signal processor according toclaim 16, wherein the digital signal processor includes instructions forcontrolling the at least one AMAAC kernel.